{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "CIS 522 Week 11 Homework.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true,
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/github/CIS-522/course-content/blob/main/tutorials/W11_DeepRL/W11_Homework.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bOJhz3372i7x"
      },
      "source": [
        "# CIS-522 Week 11 Homework\n",
        "\n",
        "\n",
        "**Instructor:** Dinesh Jayaraman\n",
        "\n",
        "**Content Creator:** Chuning Zhu"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "d1GTaygz4aqK",
        "cellView": "form"
      },
      "source": [
        "#@markdown What is your Pennkey and pod? (text, not numbers, e.g. bfranklin)\n",
        "my_pennkey = '' #@param {type:\"string\"}\n",
        "my_pod = 'Select' #@param ['Select', 'euclidean-wombat', 'sublime-newt', 'buoyant-unicorn', 'lackadaisical-manatee','indelible-stingray','superfluous-lyrebird','discreet-reindeer','quizzical-goldfish','astute-jellyfish','ubiquitous-cheetah','nonchalant-crocodile','fashionable-lemur','spiffy-eagle','electric-emu','quotidian-lion']\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-4CV3znild_L"
      },
      "source": [
        "# Section 1: Reading\n",
        "\n",
        "For this week's homework assignment, you will read the [Rainbow DQN paper](https://arxiv.org/abs/1710.02298). Rainbow combines multiple extensions to the DQN algorithm. You will **read the first 3 pages** of the paper, which gives an overview of the DQN algorithm and its extensions. Choose one extension that interests you the most and write a short paragraph explaining what problem it addresses and how it works. Then, go through the corresponding ablation studies (page 6) and describe its ablation results. Feel free to dig further by reading the original paper that proposed the extension or finding online blogposts/articles. For your reference, here's a list of extensions covered in Rainbow:\n",
        "\n",
        "- Double Q-learning\n",
        "- Prioritized replay\n",
        "- Dueling networks\n",
        "- Multi-step learning\n",
        "- Distributional RL\n",
        "- Noisy Nets"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wK2pZzfhotZu",
        "cellView": "form"
      },
      "source": [
        "#@markdown Which extension to DQN did you choose?\n",
        "extension = \"Double Q-learning\" #@param [\"Double Q-learning\", \"Prioritized replay\", \"Dueling networks\", \"Multi-step learning\", \"Distributional RL\", \"Noisy Nets\"]\n",
        "#@markdown What problem of DQN does it address? How does it work?\n",
        "explanation = \"\" #@param {type:\"string\"}\n",
        "#@markdown Describe its ablation results.\n",
        "ablation = \"\" #@param {type:\"string\"}\n",
        "try:t1;\n",
        "except NameError: t1 = time.time()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SXXDe9Hmld6O"
      },
      "source": [
        "# Section 2: Coding"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "V1mBr-Tgpyi5"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cyR0YxgAJYte",
        "cellView": "form"
      },
      "source": [
        "# @title Install\n",
        "!apt install -q xvfb\n",
        "!pip install -q pyvirtualdisplay"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "gU8GuZNjppq4",
        "cellView": "form"
      },
      "source": [
        "# @title Imports\n",
        "import time\n",
        "import gym\n",
        "import torch\n",
        "import torch.nn as nn\n",
        "import torch.optim as optim"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pFm59_SsH-bf",
        "cellView": "form"
      },
      "source": [
        "# @title Figure settings\n",
        "import matplotlib.pyplot as plt\n",
        "import matplotlib.animation as animation\n",
        "import ipywidgets as widgets\n",
        "\n",
        "%matplotlib inline \n",
        "%config InlineBackend.figure_format = 'retina'\n",
        "\n",
        "plt.rcParams.update(plt.rcParamsDefault)\n",
        "plt.rc('animation', html='jshtml')\n",
        "\n",
        "from IPython import display\n",
        "from pyvirtualdisplay import Display\n",
        "\n",
        "d = Display()\n",
        "d.start();"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_-c_lxRdp8Uy"
      },
      "source": [
        "## Implementing a simple DQN\n",
        "\n",
        "In this section, you will code up a \"vanilla\" DQN without the bells and whistles to solve a simple task from OpenAI Gym. The goal is to get some hands-on experience with the Gym interface and writing a reinforcement learning training loop from scratch.\n",
        "\n",
        "First, read through [Getting Started with Gym](https://gym.openai.com/docs/) to familiarize yourself with the OpenAI Gym library, and in particular its interface. Then, find the state dimension and the number of actions for the [MountainCar](https://gym.openai.com/envs/MountainCar-v0/) environment. Explain what the meaning of its states and actions.\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_VI6wzqCzsJQ"
      },
      "source": [
        "env = gym.make('MountainCar-v0')\n",
        "state_dim = ...\n",
        "n_actions = ...\n",
        "print(f'state_dim = {state_dim}, n_actions = {n_actions}')"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pSzFFGEF21by",
        "cellView": "form"
      },
      "source": [
        "#@markdown What is the state dimension?\n",
        "state_dimension = 0 #@param\n",
        "#@markdown How many actions are there in this environment?\n",
        "num_actions = 0 #@param\n",
        "#@markdown Explain what the states and actions correspond to in this environment.\n",
        "meaning = \"\" #@param {type:\"string\"}\n",
        "\n",
        "try:t2;\n",
        "except NameError: t2 = time.time()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "e1Y_LNwBc0jP"
      },
      "source": [
        "The follow cell defines the QNetwork class. The Q network accepts a state vector as input and outputs the Q-values for all actions given the current state. Fill in missing parameters with your answers to the previous questions. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aQIROG7b5tp5"
      },
      "source": [
        "class QNetwork(nn.Module):\n",
        "    def __init__(self):\n",
        "        super().__init__()\n",
        "        # TODO: Insert answers from previous questions\n",
        "        self.state_dim = ...\n",
        "        self.n_actions = ...\n",
        "        self.n_hidden = 100\n",
        "\n",
        "        self.layers = nn.Sequential(\n",
        "            nn.Linear(self.state_dim, self.hidden, bias=False),\n",
        "            nn.Tanh(),\n",
        "            nn.Linear(self.hidden, self.hidden, bias=False),\n",
        "            nn.Tanh(),\n",
        "            nn.Linear(self.hidden, self.n_actions, bias=False)\n",
        "        )\n",
        "\n",
        "    def forward(self, state):\n",
        "        return self.layers(state)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "C78pmKgT9GEf"
      },
      "source": [
        "Follow the instructions to complete epsilon-greedy policy and train_DQN function. Note that the raw reward of the environment is -1 for each timestep. To make learning easier, we define our own reward which considers the current position of the car. After obtaining the next state by executing an action, feed it to the provided reward_shaping function to get the shaped reward.\n",
        "\n",
        "Recall that the loss of the Q network can be computed as $$\\ell(\\phi, s, a, r, s') = \\left[Q_{\\phi}(s, a) - \\left(r + \\gamma\\max_{a'} Q_{\\phi}(s', a')\\right)\\right]^2$$ and we update the Q network by performing a gradient-descent step $$\\phi \\leftarrow \\phi - \\alpha\\nabla_{\\phi}\\ell(\\phi, s, a, r, s')$$"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fSI1BNTr46jN"
      },
      "source": [
        "def reward_shaping(next_state):\n",
        "    # Reward is position + 0.5\n",
        "    reward = next_state[0] + 0.5\n",
        "    # If goal is reached, give additional reward\n",
        "    if next_state[0] >= 0.5:\n",
        "        reward += 1\n",
        "    return reward\n",
        "\n",
        "\n",
        "def epsilon_greedy(Q_values, epsilon):\n",
        "    # TODO: select an action using epsilon-greedy policy\n",
        "    # Generate a random number between 0 and 1. If it is less than epsilon, then\n",
        "    # sample a random action from the action space. Otherwise select the action\n",
        "    # with the maximum Q-value.\n",
        "    pass\n",
        "\n",
        "\n",
        "def train_DQN(env, q_net, criterion, optimizer, scheduler, epsilon, discount, decay_rate, max_episodes):\n",
        "    successes = []\n",
        "    rewards = []\n",
        "    losses = []\n",
        "    ep_steps = 200\n",
        "    for episode in range(max_episodes):\n",
        "        ep_reward = 0\n",
        "        ep_loss = 0\n",
        "        state = env.reset()\n",
        "        for s in range(ep_steps):\n",
        "            # TODO: complete training step\n",
        "\n",
        "            # Step 1: forward state through q_net to get q values. \n",
        "\n",
        "\n",
        "            # Step 2: choose action using epsilon_greedy.\n",
        "\n",
        "\n",
        "            # Step 3: execute action in environment to get the next state.\n",
        "            # Then use reward_shaping to get the shaped reward for training.\n",
        "            # For logging purposes, do not change the provided variable names.\n",
        "            next_state, _, done, _ = ...\n",
        "            reward = ...\n",
        "\n",
        "\n",
        "            # Step 4: create Q_target from (s, a, r, s'). To do this, make a \n",
        "            # copy of the current Q values and only update the entry corresponding \n",
        "            # to the current action. Use Q.clone().detach() to make the copy.\n",
        "\n",
        "\n",
        "            # Step 5: compute loss using Q and Q_target and take one gradient \n",
        "            # step. Be sure to clear optimizer gradients before backpropagation.\n",
        "\n",
        "            \n",
        "            # Step 6: update state to next state.\n",
        "\n",
        "\n",
        "            # DO NOT MODIFY CODE BELOW\n",
        "            ep_reward += reward\n",
        "            ep_loss += loss.item()\n",
        "\n",
        "            if done: \n",
        "                # Determine whether or not this is a success\n",
        "                ep_success = float(next_state[0] >= 0.5)\n",
        "                if ep_success:\n",
        "                    # Decay the exploration parameter if we see success\n",
        "                    epsilon = epsilon * decay_rate\n",
        "                    # Step learning rate schduler\n",
        "                    scheduler.step()\n",
        "                \n",
        "                # Print training info\n",
        "                if (episode + 1) % 100 == 0:\n",
        "                    print(f'Episode: {episode + 1}, reward: {ep_reward}, success: {ep_success}')\n",
        "\n",
        "                # Logging\n",
        "                successes.append(ep_success)\n",
        "                rewards.append(ep_reward / ep_steps)\n",
        "                losses.append(ep_loss / ep_steps)\n",
        "                break\n",
        "    \n",
        "    return successes, rewards, losses     "
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BZkislgATqjD"
      },
      "source": [
        "Run the following cell to train your network."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pUQBkECoBzoO"
      },
      "source": [
        "# Set seeds for reproducibility\n",
        "env.seed(1); torch.manual_seed(1);\n",
        "env.reset()\n",
        "\n",
        "# Hyperparameters. \n",
        "discount = 0.99         # environment discount factor\n",
        "max_episodes = 1000     # number of episodes to train for\n",
        "learning_rate = 0.001   # learning rate for optimizer\n",
        "epsilon = 0.3           # episilon for greedy exploration\n",
        "decay_rate = 0.95       # decay rate for epsilon\n",
        "\n",
        "# Define network\n",
        "q_net = QNetwork()\n",
        "criterion = nn.MSELoss()\n",
        "optimizer = optim.SGD(q_net.parameters(), learning_rate)\n",
        "scheduler = optim.lr_scheduler.StepLR(optimizer, 1, 0.9)\n",
        "\n",
        "# Train network\n",
        "successes, rewards, losses = train_DQN(env, q_net, criterion, optimizer, scheduler, epsilon, discount, decay_rate, max_episodes)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VoR13JRNHhFa",
        "cellView": "form"
      },
      "source": [
        "# @title Run cell to plot success and average true reward\n",
        "plt.figure()\n",
        "plt.plot(successes)\n",
        "plt.title('Success')\n",
        "plt.show()\n",
        "plt.plot(rewards)\n",
        "plt.title('Average reward per step')\n",
        "plt.show()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yHxhNIbeBQR4",
        "cellView": "form"
      },
      "source": [
        "# @title Run cell to visualize learned agent\n",
        "def visualize(env, q_net, ep_steps):\n",
        "    # Simulate agent in environment\n",
        "    frames = []\n",
        "    state = env.reset()\n",
        "    for s in range(ep_steps):\n",
        "        Q = q_net(torch.tensor(state, dtype=torch.float))\n",
        "        action = epsilon_greedy(Q, epsilon)\n",
        "        next_state, _, done, _ = env.step(action)\n",
        "\n",
        "        display.clear_output(wait=True)\n",
        "        frames.append(env.render('rgb_array'))\n",
        "        state = next_state\n",
        "        if done:\n",
        "            break\n",
        "    \n",
        "    # Generate video\n",
        "    plt.figure(figsize=(frames[0].shape[1] / 72.0, frames[0].shape[0] / 72.0), dpi=72)\n",
        "    patch = plt.imshow(frames[0])\n",
        "    plt.axis('off')\n",
        "    animate = lambda i: patch.set_data(frames[i])\n",
        "    anim = animation.FuncAnimation(plt.gcf(), animate, frames=range(len(frames)), interval=50)\n",
        "    plt.close() # avoid showing extra plots\n",
        "    return anim\n",
        "    \n",
        "visualize(env, q_net, 200)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "62ngFhoC_fiS"
      },
      "source": [
        "# Section 3: Ethics\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "txM-spAoLR9r"
      },
      "source": [
        "## Part I: Societal Context of DRL: Perspective from Practitioners & Scholars\n",
        " \n",
        "As the practical applications of Deep Reinforcement Learning continue to grow, practitioners and scholars are raising awareness of the potential societal and ethical risks and challenges that must be addressed.  To explore these issues, we will be reading excerpts from [The Societal Implications of Deep Reinforcement Learning](https://www.jair.org/index.php/jair/article/view/12360/26667) published in the *Journal of Artificial Intelligence Research*.\n",
        " \n",
        "Read “Section 3: Challenges DRL Raises for Society, Ethics, and Governance,” which enumerates six specific areas of concern.  Select one of the concerns the authors identified and describe it below. Offer your analysis of why it is an important issue.\n",
        " "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "id": "F7zJ_NgePpIG"
      },
      "source": [
        "area_of_concern = \"\" #@param {type:\"string\"}\n",
        "\n",
        "try:t3;\n",
        "except NameError: t3 = time.time()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZxnNnr2rL0MF"
      },
      "source": [
        "Now, complete your reading of “Section 4: Avenues of Progress in DRL and their Implications,” “Section 5. Discussion” and “Section 6: Summary and Conclusion.”  Do the authors suggest any solutions to the concern you discussed above?  If so, describe their approach.  If not, do you have your own recommendation?\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "id": "O2DCnQJpL0_R"
      },
      "source": [
        "solution_to_concern = \"\" #@param {type:\"string\"}\n",
        "\n",
        "try:t4;\n",
        "except NameError: t4 = time.time()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "B9LYjZEePPC0"
      },
      "source": [
        "## Part II: Stakeholder Perspective: A Proposed Multi-Factor Analysis for Governmental Use of Autonomous Systems\n",
        " \n",
        "Next, consider the broader real-world uses of autonomous systems from the perspective of various users, spanning from corporate entities to government agencies.  Government agencies face particular ethical and legal challenges when confronted with the choice between adopting an artificial intelligence system or continuing to have humans making the decisions.\n",
        " \n",
        "Read Part IV and the Conclusion of [“A Framework for Governmental Use of Machine Learning”](https://www.acus.gov/sites/default/files/documents/Coglianese%20ACUS%20Final%20Report%20w%20Cover%20Page.pdf), produced by Professor Cary Coglianese, University of Pennsylvania Law School, for the Administrative Conference of the United States (ACUS) in December 2020, which analyses the uses and implications of AI systems in the government and proposes a framework for public officials to use in deciding when to adopt AI tools. \n",
        " \n",
        "The report proposes a **multifactor analysis method** for government agencies to choose between AI and the status quo. Write a response addressing the following questions. \n",
        "- Considering your prior reading on the societal implications of DRL, is this multifactored method effective when agencies are faced with the choice of whether or not to adopt an autonomous system? \n",
        "- Are the prongs proposed in the report sufficient to address the issue you analyzed above? Why or why not? \n",
        "- Are these consistent with the proposals of the authors in “The Societal Implications of Deep Reinforcement Learning”? If not, which one do you think would better address the societal concerns arising from DRL? Would you propose a different approach?  \n",
        "- In what ways do DRL systems introduce additional concerns than other autonomous systems?"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VruyAgVaQMTE",
        "cellView": "form"
      },
      "source": [
        "part_2_response = \"\" #@param {type:\"string\"}\n",
        "\n",
        "try:t5;\n",
        "except NameError: t5 = time.time()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "leOYo_hkiFYE"
      },
      "source": [
        "# Submission\n",
        "\n",
        "Once you're done, click on 'Share' and add the link to the box below. If you did not use CoLab, you can also upload the file or notebook in the form below."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "id": "W2QS_LshiVst"
      },
      "source": [
        "link = '' #@param {type:\"string\"}"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "id": "hprJKnQDiZDo"
      },
      "source": [
        "import time\n",
        "import numpy as np\n",
        "import urllib.parse\n",
        "from IPython.display import IFrame\n",
        "\n",
        "\n",
        "#@markdown #Run Cell to Show Airtable Form\n",
        "#@markdown ##**Confirm your answers and then click \"Submit\"**\n",
        "\n",
        "\n",
        "def prefill_form(src, fields: dict):\n",
        "  '''\n",
        "  src: the original src url to embed the form\n",
        "  fields: a dictionary of field:value pairs,\n",
        "  e.g. {\"pennkey\": my_pennkey, \"location\": my_location}\n",
        "  '''\n",
        "  prefill_fields = {}\n",
        "  for key in fields:\n",
        "      new_key = 'prefill_' + key\n",
        "      prefill_fields[new_key] = fields[key]\n",
        "  prefills = urllib.parse.urlencode(prefill_fields)\n",
        "  src = src + prefills\n",
        "  return src\n",
        "\n",
        "\n",
        "#autofill fields if they are not present\n",
        "#a missing pennkey and pod will result in an Airtable warning\n",
        "#which is easily fixed user-side.\n",
        "try: my_pennkey;\n",
        "except NameError: my_pennkey = \"\"\n",
        "try: my_pod;\n",
        "except NameError: my_pod = \"Select\"\n",
        "try: extension;\n",
        "except NameError: extension = \"Double Q-learning\"\n",
        "try: explanation;\n",
        "except NameError: explanation = \"\"\n",
        "try: ablation;\n",
        "except NameError: ablation = \"\"\n",
        "try: state_dimension;\n",
        "except NameError: state_dimension = 0\n",
        "try: num_actions;\n",
        "except NameError: num_actions = 0\n",
        "try: meaning;\n",
        "except NameError: meaning = \"\"\n",
        "try: area_of_concern;\n",
        "except NameError: area_of_concern = \"\"\n",
        "try: solution_to_concern;\n",
        "except NameError: solution_to_concern = \"\"\n",
        "try: part_2_response;\n",
        "except NameError: part_2_response = \"\"\n",
        "try: link;\n",
        "except NameError: link = \"\"\n",
        "\n",
        "fields = {\"pennkey\": my_pennkey,\n",
        "          \"pod\": my_pod,\n",
        "          \"extension\": extension,\n",
        "          \"explanation\": explanation,\n",
        "          \"ablation\": ablation,\n",
        "          \"state_dimension\": state_dimension,\n",
        "          \"num_actions\": num_actions,\n",
        "          \"meaning\": meaning,\n",
        "          \"area_of_concern\": area_of_concern,\n",
        "          \"solution_to_concern\": solution_to_concern,\n",
        "          \"part_2_response\": part_2_response,\n",
        "          \"link\": link}\n",
        "\n",
        "src = \"https://airtable.com/embed/shrb6cgGnu17S8MhJ?\"\n",
        "\n",
        "\n",
        "#now instead of the original source url, we do: src = prefill_form(src, fields)\n",
        "display.display(IFrame(src = prefill_form(src, fields), width = 800, height = 400))"
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}
